Add Llama NVFP4 PTQ recipes and MLP-only FP8-cast preset. by kiranbeethoju · Pull Request #1645 · NVIDIA/Model-Optimizer

kiranbeethoju · 2026-06-06T19:04:41Z

Expose huggingface/llama/ptq paths for partial and full NVFP4 on Llama 3.x, add the missing general nvfp4_mlp_only-kv_fp8_cast recipe, and cover loading in unit tests so recipe validation runs on CPU-only hosts.

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added NVFP4 quantization recipes with FP8 KV-cache casting for Llama models
- Introduced MLP-only NVFP4 variant for selective layer quantization
Documentation
- Updated recipe selection guides with Llama 3.x NVFP4 configuration examples
- Added comprehensive documentation for Llama PTQ recipes, including hardware requirements and KV-calibration guidance

Expose huggingface/llama/ptq paths for partial and full NVFP4 on Llama 3.x, add the missing general nvfp4_mlp_only-kv_fp8_cast recipe, and cover loading in unit tests so recipe validation runs on CPU-only hosts. Signed-off-by: kiranbeethoju <kiranbeethoju@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

copy-pr-bot · 2026-06-06T19:04:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-06T19:04:53Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ff7f9f07-13ef-438a-a304-6196f54c5394

📥 Commits

Reviewing files that changed from the base of the PR and between 52f1ccb and 41d56ef.

📒 Files selected for processing (7)

examples/llm_ptq/README.md
modelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
modelopt_recipes/huggingface/README.md
modelopt_recipes/huggingface/llama/ptq/README.md
modelopt_recipes/huggingface/llama/ptq/nvfp4_default-kv_fp8_cast.yaml
modelopt_recipes/huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml
tests/unit/recipe/test_loader.py

📝 Walkthrough

Walkthrough

This PR introduces new NVFP4 PTQ recipe configurations for Llama models, with both general and Hugging Face–specific variants combining MLP-only and default W4A4 quantization strategies with FP8 KV-cache casting, along with documentation and test coverage.

Changes

NVFP4 PTQ Recipes and Documentation

Layer / File(s)	Summary
General NVFP4 MLP-only recipe with FP8 KV cast `modelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml`	New general PTQ recipe that applies NVFP4 quantization only to MLP/MoE weight and input quantizers via pattern matching, uses max-based quantization, and composes FP8 KV-cache casting with disabled quantizer configurations.
Hugging Face Llama PTQ recipe variants `modelopt_recipes/huggingface/llama/ptq/nvfp4_default-kv_fp8_cast.yaml`, `modelopt_recipes/huggingface/llama/ptq/nvfp4_mlp_only-kv_fp8_cast.yaml`	Two model-specific recipes for Llama: `nvfp4_default-kv_fp8_cast` applies W4A4 NVFP4 across all linear layers, and `nvfp4_mlp_only-kv_fp8_cast` restricts quantization to MLP and MoE layers; both include FP8 KV-cache casting via imported units.
Documentation and guidance for recipe selection `examples/llm_ptq/README.md`, `modelopt_recipes/huggingface/README.md`, `modelopt_recipes/huggingface/llama/ptq/README.md`	Example README recommends the MLP-only recipe for Llama 3.x NVFP4 quantization; Hugging Face README updated to clarify model-specific recipe discovery; new Llama PTQ README documents recipe variants, KV calibration differences, usage example, and GPU/runtime requirements.
Recipe loader smoke-test coverage `tests/unit/recipe/test_loader.py`	Extends `_BUILTIN_PTQ_RECIPES` test catalog with the new general `nvfp4_mlp_only-kv_fp8_cast` and two Hugging Face Llama recipe paths to ensure all variants load correctly.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1525: Adds Hugging Face NVFP4 PTQ recipes that rely on kv_fp8_cast KV-cache casting via use_constant_amax, which aligns with these recipes' FP8 KV-cache casting composition strategy.

Suggested reviewers

ChenhanYu
jenchen13
yueshen2016

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: adding Llama NVFP4 PTQ recipes and an MLP-only FP8-cast preset, which aligns with the file changes across documentation and recipe configurations.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR adds YAML recipes, docs, and test updates. No unsafe torch.load, numpy.load, trust_remote_code, eval/exec, nosec comments, or new dependencies detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kiranbeethoju requested review from a team as code owners June 6, 2026 19:04

kiranbeethoju requested a review from cjluo-nv June 6, 2026 19:04

coderabbitai Bot approved these changes Jun 6, 2026

View reviewed changes

kiranbeethoju closed this Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Llama NVFP4 PTQ recipes and MLP-only FP8-cast preset.#1645

Add Llama NVFP4 PTQ recipes and MLP-only FP8-cast preset.#1645
kiranbeethoju wants to merge 1 commit into
NVIDIA:mainfrom
kiranbeethoju:feat/llama-nvfp4-ptq-recipes

kiranbeethoju commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kiranbeethoju commented Jun 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kiranbeethoju commented Jun 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading